Skip to content

Conversation

@astefanutti
Copy link
Contributor

@astefanutti astefanutti commented Jul 30, 2025

What this PR does / why we need it:

This enables users to pass environment variables to the custom trainer.

/assign @kubeflow/kubeflow-trainer-team @szaher @kramaranya @eoinfennessy @briangallagher

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented Jul 30, 2025

Pull Request Test Coverage Report for Build 16677180921

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 2 of 2 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.2%) to 65.679%

Totals Coverage Status
Change from base Build 16655049250: 0.2%
Covered Lines: 266
Relevant Lines: 405

💛 - Coveralls

Copy link
Member

@eoinfennessy eoinfennessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@google-oss-prow
Copy link

@eoinfennessy: changing LGTM is restricted to collaborators

In response to this:

LGTM!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @astefanutti!
I have general question, when do you think we should recommend users to use env or func_args ?
I can imagine that system env should be configured by Platform Admins.

pip_index_url (`Optional[str]`): The PyPI URL from which to install Python packages.
num_nodes (`Optional[int]`): The number of nodes to use for training.
resources_per_node (`Optional[Dict]`): The computing resources to allocate per node.
env (`Optional[Dict[str, str]]`): Environment variables to set in the training containers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more explicit, since SDK users should not be aware of containers ?

Suggested change
env (`Optional[Dict[str, str]]`): Environment variables to set in the training containers.
env (`Optional[Dict[str, str]]`): The environment variables to set in the training nodes.

Comment on lines 414 to 423
if trainer.env:
env_vars = []
for key, value in trainer.env.items():
env_vars.append(
models.IoK8sApiCoreV1EnvVar(
name=key,
value=value
)
)
trainer_crd.env = env_vars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like this:

Suggested change
if trainer.env:
env_vars = []
for key, value in trainer.env.items():
env_vars.append(
models.IoK8sApiCoreV1EnvVar(
name=key,
value=value
)
)
trainer_crd.env = env_vars
if trainer.env:
trainer_crd.env = [
models.IoK8sApiCoreV1EnvVar(name=key, value=value)
for key, value in trainer.env.items()
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@kramaranya kramaranya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you @astefanutti!

Comment on lines 286 to 288
if add_built_in_trainer:
train_job = add_built_in_trainer_to_job(train_job)
if add_custom_trainer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if add_built_in_trainer:
train_job = add_built_in_trainer_to_job(train_job)
if add_custom_trainer:
if add_built_in_trainer:
train_job = add_built_in_trainer_to_job(train_job)
elif add_custom_trainer:

Comment on lines 414 to 423
if trainer.env:
env_vars = []
for key, value in trainer.env.items():
env_vars.append(
models.IoK8sApiCoreV1EnvVar(
name=key,
value=value
)
)
trainer_crd.env = env_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
@astefanutti
Copy link
Contributor Author

@andreyvelich @kramaranya thanks a lot for the review.

I should have addressed all your comments.

@kramaranya
Copy link
Contributor

Thank you for this, @astefanutti!
/lgtm

@astefanutti
Copy link
Contributor Author

astefanutti commented Aug 1, 2025

Thanks for this @astefanutti! I have general question, when do you think we should recommend users to use env or func_args ? I can imagine that system env should be configured by Platform Admins.

I would say this is convenient and useful for AI practitioners to be able to pass environment variables like HF_TOKEN, PYTHON_*, NCCL_* ...

On the other hand func_args is more about hyper-parameters and the training configuration, while environment variables is a universal mechanism that I'm sure most AI practitioners are already familiar with to configure runtime components.

At least for custom trainers, I would be inclined to think users would prefer being able to pass these variables directly than asking platform admins to configure them each time on training runtimes.

@google-oss-prow google-oss-prow bot removed the lgtm label Aug 1, 2025
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@astefanutti Thanks for creating this! Overall LGTM!

/lgtm

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @astefanutti!
/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit cc4cbe0 into kubeflow:main Aug 1, 2025
8 checks passed
@google-oss-prow google-oss-prow bot added this to the v0.1 milestone Aug 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants